Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

conflicts: escape conflict markers by making them longer #4969

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

scott2000
Copy link
Collaborator

@scott2000 scott2000 commented Nov 25, 2024

Related to #3975 and #4088.

Checklist

If applicable:

  • I have updated CHANGELOG.md
  • I have updated the documentation (README.md, docs/, demos/)
  • I have updated the config schema (cli/src/config-schema.json)
  • I have added tests to cover my changes

@scott2000 scott2000 force-pushed the escape-conflict-markers branch 4 times, most recently from c547024 to d62d14c Compare November 25, 2024 17:29
@joyously
Copy link

Perhaps this should be tested by correctly parsing its own source code.

@scott2000
Copy link
Collaborator Author

Perhaps this should be tested by correctly parsing its own source code.

Surprisingly, the source code for conflicts.rs doesn't actually contain any lines that look like conflict markers! The only two files that have lines which can be confused for conflict markers currently are tutorial.md and conflicts.md it looks like.

@scott2000 scott2000 force-pushed the escape-conflict-markers branch from d62d14c to 6446b3d Compare November 26, 2024 15:35
@scott2000 scott2000 marked this pull request as ready for review November 26, 2024 15:46
@scott2000 scott2000 force-pushed the escape-conflict-markers branch from 6446b3d to 9f0be3b Compare November 26, 2024 20:29
lib/src/conflicts.rs Outdated Show resolved Hide resolved
lib/src/conflicts.rs Outdated Show resolved Hide resolved
lib/src/conflicts.rs Outdated Show resolved Hide resolved
lib/src/conflicts.rs Outdated Show resolved Hide resolved
debug_assert!(expected_len >= MIN_CONFLICT_MARKER_LEN);

parse_conflict_marker_any_len(line)
.filter(|marker| marker.len == expected_len)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to accept marker.len >= expected_len?

Suppose working-copy content was materialized with N = 10, and user removed a literal "<<<<<<<<<" (N = 9) without resolving any other conflicts, the new expected_len would drop. iirc, the working-copy content isn't updated to use the new expected_len.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I had originally, but I think it's too permissive actually. If we just had a minimum, and didn't enforce an exact number, then this could happen:

  1. Conflict materialized with length 12 markers since file contains length 8 marker
  2. User deletes length 8 marker
  3. Conflict is parsed expecting length >=12 markers successfully
  4. User adds length 8 marker back, since they made a mistake while editing
  5. Conflict is parsed expecting length >=7 markers now, so both the length 8 and length 12 markers are parsed, even though the length 8 marker isn't a real marker

To solve this, I made it so that only the longest length markers in a file are parsed. So the example would look like this:

  1. Conflict materialized with length 12 markers since file contains length 8 marker
  2. User deletes length 8 marker
  3. Conflict is parsed expecting length 12 markers successfully
  4. User adds length 8 marker back, since they made a mistake while editing
  5. Conflict is parsed expecting length 12 markers (since 12 is the longest marker present, and 12 >= 7), so the length 8 markers are correctly ignored

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we can't know whether the re-added marker is a literal or conflict, right? My feeling is that it's safer to assume that a parsable marker is a conflict rather than invalidating any other conflicts by single addition of marker-looking line.

If we prefer stricter behavior, we'll probably need to store the materialized marker length to TreeState on checkout.

Copy link
Collaborator Author

@scott2000 scott2000 Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point. Sorry for the long response; I was just trying to think through all the reasonable options there might be and summarize them. I think these are the variables we can use for setting a bound:

  • R = length of real conflict markers in original materialization (would need to be stored in TreeState)
  • N = length of longest fake conflict markers in last snapshot's Merge<FileId>
  • K = increment added to longest fake marker to get materialized marker length
  • C = length of longest conflict marker (real or fake) in the raw file contents being snapshotted

I think it's important that we have a simple rule that the user can follow while editing a file to ensure that the real conflict markers will continue to be parsed successfully. These are the 4 options I can think of, along with the expressions that we could use to determine if a marker is parsable:

Possible options

Option 1: "Don't add any new fake markers which are longer than any fake markers in the last snapshotted version of the file"

Either len >= max(7, N+K) or len >= max(7, N+1) would work here.

Option 2: "Don't add any new fake markers which are longer than any fake markers in the original materialized file"

I think len == max(7, N+K, C) matches this rule, which is what I currently have implemented. The main difference with this and option 1 is that the user is allowed to add back fake conflict markers that they removed, and they still won't be parsed.

Option 3: "Don't add any new fake markers with length greater than or equal to the markers that jj added"

I think len >= R is the obvious implementation of this rule. It does require adding it to the TreeState though.

I think len == max(7, N+1, C) would also fall into this category, since the user could still add any fake markers that are shorter than the original materialized markers, since that would keep the original materialized markers as being the longest in the file, so it would continue parsing them successfully. This is different from len == max(7, N+K, C), because that rule would cause snapshotting to fail the next time after adding any fake markers of length N+1 through N+K-1, since that would result in the real conflict markers becoming considered shorter than allowed in the following snapshot.

Option 4: "Don't add any new fake markers with the exact same length as the markers that jj added"

This is probably the easiest rule to follow for the user, and it is what we currently have in jj 0.23.0, except that the materialized length is always 7. This could be achieved with len == R, which also requires adding it to the TreeState.


If we don't store the materialized marker length in the TreeState, then I believe adding a fake marker which is longer than the real materialized markers will always cause the real markers to not be parsed correctly. With some implementations it might fail to parse in the current snapshot (because it picked the wrong markers to parse), or with other implementations it might parse successfully in the current snapshot, but then fail in the next snapshot (because the inferred minimum conflict marker length increased as a result of there being new fake conflict markers added), or it could parse successfully but just have the wrong content (since some fake markers were treated as real ones and happened to form a valid conflict).

With that in mind, I feel like these are the 3 best options, each with different tradeoffs (I think this is basically what you were saying, with the first and last options being the safest choices):

  • len >= max(7, N+1), since it is the simplest to implement, and it handles all of the common cases reasonably well.
  • len == max(7, N+1, C), since it allows adding any fake markers that are shorter than the real markers, which the user might expect to work.
  • len == R since it relies on the fewest assumptions (but it requires storing marker length in the TreeState).

I think I'll look into adding it to the TreeState to see how difficult it would be. If it's a small change, then probably that would be the best option?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll look into adding it to the TreeState to see how difficult it would be. If it's a small change, then probably that would be the best option?

It wouldn't be difficult, but I'm not pretty sure if that's better because we're adding a new persistent state invisible to user. (BTW, the materialized conflict style could be stored in the TreeState if supported styles had ambiguous markers.)

@ilyagr, @martinvonz, any comments?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • R = length of real conflict markers in original materialization (would need to be stored in TreeState)

Do we need to store it in the TreeState? Can we not get it from the previous content in the tree object?

Do you mean getting it from the tree of the working-copy parent? It might work, but I'm not sure.

FWIW, I think it makes sense to trust TreeState here because the materialization parameter is kinda working-copy state, not a state of the committed tree.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR to first implement parsing if >= materialized_marker_len, and then I added conflict_marker_len to the file state in another commit to see how it could work. If we decide not to add it to the file state, I can always drop the last commit from the PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean getting it from the tree of the working-copy parent? It might work, but I'm not sure.

I meant getting it from the previous snapshot. But it seems like it would not handle the case in test_update_from_content_malformed_conflict anyway (unless we allowed the snapshotting process to make the conflict markers in the working copy longer, but that seems like a really bad idea).

I was thinking that it's good if TreeState is just a cache that can we can discard without impact other than in the corner case where a file has been modified in a way that the cache prevents us from detecting (currently if the mtime and size is preserved but e.g. inode is changed).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that it's good if TreeState is just a cache that can we can discard without impact other than in the corner case where a file has been modified in a way that the cache prevents us from detecting (currently if the mtime and size is preserved but e.g. inode is changed).

Currently, I have it set to default to parsing length 7 conflict markers if there's no materialized conflict data in the TreeState, so this should still work in many cases since it will parse all conflict markers of length >= 7 if the TreeState is reset, and the conflict markers should usually be length 7 anyways.

If we want to make it possible to reset the TreeState in more cases, we could have the conflict marker length default to choose_materialized_conflict_marker_len(current_tree_merge) instead maybe? There is a tradeoff though, because doing this could break already-materialized conflicts from older versions of jj if a user upgrades and the new materialized markers are longer than the actual markers in the file (which would be length 7 always in older versions).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to keep it simple for now.

lib/src/conflicts.rs Outdated Show resolved Hide resolved
lib/src/conflicts.rs Outdated Show resolved Hide resolved
// considered longer than necessary the next time the file is snapshotted.
let expected_marker_len = materialized_marker_len
.unwrap_or_else(|| choose_conflict_marker_len(&merge_hunk))
.max(max_content_marker_len);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we trust the marker length of the new content? Doesn't it mean any existing conflicts can be invalidated if user added long <<<<<<<<<<<<< literal by mistake?

Copy link
Collaborator Author

@scott2000 scott2000 Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they would be invalidated if the user adds a marker with length greater than or equal to the materialized marker length. I'm not sure we can work around that though, since there will always be some length of marker that's invalid for them to add.

My goal was to make it so that if a file initially had fake markers of length N when it was materialized, then the user can always add/remove as many markers as they want with length <= N without breaking parsing, even if multiple snapshot operations happen during the process. I think it makes sense that adding markers of length > N could interfere with conflict resolution, since it makes it ambiguous whether the marker is a real one or not.

lib/src/conflicts.rs Show resolved Hide resolved
@scott2000 scott2000 force-pushed the escape-conflict-markers branch from 9f0be3b to cb50ba3 Compare December 3, 2024 01:25
Copy link
Member

@martinvonz martinvonz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I thought I had already sent these comments last week

/// Represents a kind of conflict marker which can be materialized and parsed.
#[derive(Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
enum ConflictMarkerKind {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: How about ConflictMarkerSeparator or ConflictMarkerSeparatorChar since it just represents individual separators rather than the whole marker?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I feel about "separator". I personally have always thought of each individual line like "<<<<<<<" as being markers, rather than just being a part of a marker. Maybe ConflictMarkerLine and ConflictMarkerLineChar would be clearer? I don't feel strongly though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ConflictMarkerChar, LineChar, LineKind are good, if the original concern was that ConflictMarkerKind sounds like a ConflictMarkerStyle.

debug_assert!(expected_len >= MIN_CONFLICT_MARKER_LEN);

parse_conflict_marker_any_len(line)
.filter(|marker| marker.len == expected_len)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point about the invisible state. I think that's enough to convince me that one of the options is better. I'm leaning towards the first choice because it's simplest. But it's a good point about in the second option about presence of markers of different lengths. These would be real markers or markers that we would at least interpret as real. Maybe another option is to use option 1 but additionally require that all markers have the same length, or otherwise interpret the file as resolved. I don't strongly which of these three options we go with.

By the way, I think background snapshotting makes Scott's example at the start of this thread more important because you might save and then undo and save again and get two snapshots, which would then get an additional real conflict. OTOH, background snapshotting already makes it so a save while your editing conflict markers often results the file getting interpreted as resolved instead, so we're not making it much worse by adding this new risk of reinterpreting markers.

@scott2000 scott2000 force-pushed the escape-conflict-markers branch 5 times, most recently from 8b28066 to 5299c8f Compare December 7, 2024 15:41
lib/src/conflicts.rs Outdated Show resolved Hide resolved
lib/tests/test_conflicts.rs Outdated Show resolved Hide resolved
"};
// On the first snapshot, it will parse as a conflict containing conflict
// markers as text
let new_conflict = parse(&conflict, new_conflict_contents.as_bytes());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. This one is scary, and would probably provide a worse UX than the current state.

We'll have to either remember the materialized marker length or parse unpaired markers into some form of conflict.

cli/src/diff_util.rs Outdated Show resolved Hide resolved
/// Represents a kind of conflict marker which can be materialized and parsed.
#[derive(Clone, Copy, PartialEq, Eq)]
#[repr(u8)]
enum ConflictMarkerKind {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ConflictMarkerChar, LineChar, LineKind are good, if the original concern was that ConflictMarkerKind sounds like a ConflictMarkerStyle.

@scott2000 scott2000 force-pushed the escape-conflict-markers branch from 5299c8f to 82cad26 Compare December 8, 2024 16:22
lib/tests/test_conflicts.rs Outdated Show resolved Hide resolved
debug_assert!(expected_len >= MIN_CONFLICT_MARKER_LEN);

parse_conflict_marker_any_len(line)
.filter(|marker| marker.len == expected_len)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean getting it from the tree of the working-copy parent? It might work, but I'm not sure.

I meant getting it from the previous snapshot. But it seems like it would not handle the case in test_update_from_content_malformed_conflict anyway (unless we allowed the snapshotting process to make the conflict markers in the working copy longer, but that seems like a really bad idea).

I was thinking that it's good if TreeState is just a cache that can we can discard without impact other than in the corner case where a file has been modified in a way that the cache prevents us from detecting (currently if the mtime and size is preserved but e.g. inode is changed).

lib/tests/test_conflicts.rs Show resolved Hide resolved
@scott2000 scott2000 force-pushed the escape-conflict-markers branch 4 times, most recently from cf0c834 to b99285a Compare December 17, 2024 01:05
These changes make the code a bit more readable, and they will make it
easier to have conflict markers of different lengths in the next commit.
If a file contains lines which look like conflict markers, then we need
to make the real conflict markers longer so that the materialized
conflicts can be parsed unambiguously.

When parsing the conflict, we require that the conflict markers are at
least as long as the materialized conflict markers based on the current
tree. This can lead to some unintuitive edge cases which will be solved
in the next commit.

For instance, if we have a file explaining the differences between
Jujutsu's conflict markers and Git's conflict markers, it could produce
a conflict with long markers like this:

```
<<<<<<<<<<< Conflict 1 of 1
%%%%%%%%%%% Changes from base to side jj-vcs#1
 Jujutsu uses different conflict markers than Git, which just shows the
-sides of a conflict without a diff.
+sides of a conflict without a diff:
+
+<<<<<<<
+left
+|||||||
+base
+=======
+right
+>>>>>>>
+++++++++++ Contents of side jj-vcs#2
Jujutsu uses different conflict markers than Git:

<<<<<<<
%%%%%%%
-base
+left
+++++++
right
>>>>>>>
>>>>>>>>>>> Conflict 1 of 1 ends
```

We should support these options for "git" conflict marker style as well,
since Git actually does support producing longer conflict markers in
some cases through .gitattributes:

https://git-scm.com/docs/gitattributes#_conflict_marker_size

We may also want to support passing the conflict marker length to merge
tools as well in the future, since Git supports a "%L" parameter to pass
the conflict marker length to merge drivers:

https://git-scm.com/docs/gitattributes#_defining_a_custom_merge_driver
Storing the conflict marker length in the working copy makes conflict
parsing more consistent, and it allows us to parse valid conflict hunks
even if the user left some invalid conflict markers in the file while
resolving the conflicts.
@scott2000 scott2000 force-pushed the escape-conflict-markers branch from b99285a to f2d6c28 Compare December 17, 2024 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants